List of Packages used;
- caret
- corrplot
- dplyr
- missForest
library(e1071) # to understand skewness
library(dplyr)
library(stringr) # Used to rename the columns by removing the word team from the column header
library(VIM) # To understand NAs
library(caret)
library("MASS") # to use for robust Linear Regression.
# browse to the data
moneyball = read.csv('/Users/legs_jorge/Documents/Data Science Projects/MSDS_Northwestern/MSDS 411/Unit 01 Moneyball Baseball Problem/Data/moneyball.csv', header = T)
colnames(moneyball) <- str_replace_all(colnames(moneyball),"TEAM_","") %>%
tolower() # Fixing column names
Outliers can cause our model to produce the wrong output by influencing its fit. Creating boxplots will aid in identifying those outliers. We can also use the cleveland dotplot to understand the outliers better. This technique uses the row number against actual value to quickly point out any patterns of outliers. This plot will easilly allow us to check the raw data for errors such as typos during the data collection phase. Points on the far right side, or on the far left side, are observed values that are considerably larger, or smaller, than the majority of the observations, and require further investigation. When we use this chart, together with the box plot and histogram, we can easily identify patterns at to where in the data we’re seeing outliers.
par(mfrow = c(1, 3))
i = 2
while (i %in% c(2:17)) {
plot(moneyball[,i], moneyball$index, xlab = colnames(moneyball)[i] , ylab = "Index", main = paste("cleveland dotplot of ",colnames(moneyball)[i]))
boxplot(moneyball[,i], col = "#A71930", main = paste("Boxplot of ",colnames(moneyball)[i]))
hist(
moneyball[,i],
col = "#A71930",
xlab = colnames(moneyball)[i],
main = paste("Histogram of ",colnames(moneyball)[i])
)
i = i + 1
}
It looks like the outliers are legitmate and we will try two techniques to deal with them;
1. Use Robust linear Regression
2. Use Spatial Sign transformation after scaling and centering the data.
Now that step one is done, let’s look at step 2.
From the historgram above we can clearly see that the data is not normal, with the exception of some that seems to sort of follow a normal distribution. Let’s use QQ-plot to test each column for normality, while adding a histogram and a Skewness number.
- If skewness is less than −1 or greater than +1, the distribution is highly skewed.
- If skewness is between −1 and −½ or between +½ and +1, the distribution is moderately skewed.
- If skewness is between −½ and +½, the distribution is approximately symmetric.
par(mfrow = c(2, 2))
i = 2
while (i %in% c(2:17)) {
qqnorm(moneyball[,i], main = paste("QQ-Plot of ",colnames(moneyball)[i]));qqline(moneyball[,i], col = 2)
hist(
moneyball[,i],
col = "#A71930",
xlab = colnames(moneyball)[i],
main = paste0("Skewness = ",skewness(moneyball[,i]))
)
i = i + 1
}
We would need to try certain transformation to correct for Skewness, with Box-Cox being the number one choice.
R gives us a lot of ways to understand the distribution of Nulls within the data. Let’s first try to calculate the percentage of Null values to the total number of observation.
NAPerc <-
sapply(moneyball, function(x)
(sum(is.na(x)) / length(x)) * 100) %>%
data.frame()
NAPerc$Column <- rownames(NAPerc)
colnames(NAPerc) <- c("NA_Perc", "Col_Name")
# Trying to understand the percentage of NAs per Column
NA_col <- subset(NAPerc, NA_Perc > 0) %>% arrange(desc(NA_Perc))
NA_col
NA_subset <- moneyball[, c(NA_col$Col_Name)]
matrixplot(NA_subset, labels = TRUE, interactive = TRUE)
Click in a column to sort by the corresponding variable.
To regain use of the VIM GUI and the R console, click outside the plot region.
Let’s look at the pattern of missing data to try to get more insights. It’s clear that TEAM_BATTING_HBP is going to be a problematic column with 92% of the data missing. Before we start the imputation, let’s try to understand why we have missing data.
There are two types of missing data
Let’s use the mice package to help us understant how all the NAs behave in the data. mice provides a handy function called md.pattern that allows one to understand the pattern of missing data. Hopefully by looking at the pattern, we can have an idea on why the data could be missing.
md.pattern(moneyball) %>% data.frame()
The first column of the output shows the number of unique missing data patterns. There are 191 observations with nonmissing values, and there are 1295 observations with nonmissing values except for the variable batting_hbp. The rightmost column shows the number of missing variables in a particular missing pattern. For example, the first row has no missing value and it is “0” in the row. The last row counts the number of missing values for each variable. For example, the variable pitching_bb contains no missing values and the variable batting_so contains 102 missing values. This table can be helpful when you decide to drop some observations with missing variables exceeding a preset threshold.
After doing some reading it looks like that columns could be translated to a 0 or 1. I need to do more investigating. From working experience, usually columns with high volumes of NAs indicates important informations, simply because they could be capturing rare instance where a process fails. Instead of deleting it, I will try to see if I can transform it into a categorical variable with 1s and 0s. Now, regarding the other one, I will try some other inputation methods such as KNN, mean, median, etc.
Let’s create a series of scatter plots to understand how each independent variable interacts with the dependent variable. These scatter plots will help us spot any infrigement of the assupmtions needed to develop a robust OLS model, namely multicollinearity.
chart.Correlation(moneyball, histogram = TRUE, pch = 1, method = c("pearson"))
In the above plot:
The distribution of each variable is shown on the diagonal. On the bottom of the diagonal : the bivariate scatter plots with a fitted line displayed on the top of the diagonal: the value of the correlation plus the significance level as stars. Each significance level is associated to a symbol: p-values(0, 0.001, 0.01, 0.05, 0.1, 1) <=> symbols(“***”, “**”, “*”, “.”, " “)
As we go across the second row, we notice that variables aren’t strongly correlated to our target variable. One can also notice correlation numbers up to 0.97 among our independent variables. Those variables will pose a problem when you include them in a model.
The Caret package offers the findcorrelation(), which takes the correlation matrix as an input and finds the fields causing multicollinearity based on a threshold, the cutoff parameter. It in turns returns a vector with values that would need to be removed from our dataset due to correlation.
paste0("Need to exclude ", colnames(moneyball)[findCorrelation(cor(moneyball))])
[1] "Need to exclude batting_hr"
We will need to revisit this after we have imputed/taken care of Nulls.
Before we start scaling, centering, or apply any other modification to the dataset, let make sure we have taken care of the null values. Let’s first focus on the columns with the highest volume of nulls, batting_hbp.
For this column we will try to transform it into a binary variable, 1s and 0s. I don’t understand baseball, but it seems that this variable is not missing at random, so we could simple say Hitt by pitch or not. I will also see if imputation using the MICE package will help in any way.
moneyball_trans <- subset(moneyball_trans, select = -c(batting_hbp)) # Dropping the variable we just transformed.
Error in eval(expr, envir, enclos) : object 'batting_hbp' not found
Now that that variable is taken care of, let’s start imputing missing values using mice. Since we only have numeric values, mice will automatically chose PMM (Predictive Mean Matching)
Now that we have imputed the data, let’s do a quick summary of the data to see how it looks like.
summary(moneyball_trans)
index target_wins batting_h batting_2b batting_3b batting_hr batting_bb
Min. : 1.0 Min. : 0.00 Min. : 891 Min. : 69.0 Min. : 0.00 Min. : 0.00 Min. : 0.0
1st Qu.: 630.8 1st Qu.: 71.00 1st Qu.:1383 1st Qu.:208.0 1st Qu.: 34.00 1st Qu.: 42.00 1st Qu.:451.0
Median :1270.5 Median : 82.00 Median :1454 Median :238.0 Median : 47.00 Median :102.00 Median :512.0
Mean :1268.5 Mean : 80.79 Mean :1469 Mean :241.2 Mean : 55.25 Mean : 99.61 Mean :501.6
3rd Qu.:1915.5 3rd Qu.: 92.00 3rd Qu.:1537 3rd Qu.:273.0 3rd Qu.: 72.00 3rd Qu.:147.00 3rd Qu.:580.0
Max. :2535.0 Max. :146.00 Max. :2554 Max. :458.0 Max. :223.00 Max. :264.00 Max. :878.0
batting_so baserun_sb baserun_cs pitching_h pitching_hr pitching_bb pitching_so
Min. : 0.0 Min. : 0 Min. : 0 Min. : 1137 Min. : 0.0 Min. : 0.0 Min. : 0.0
1st Qu.: 545.0 1st Qu.: 67 1st Qu.: 42 1st Qu.: 1419 1st Qu.: 50.0 1st Qu.: 476.0 1st Qu.: 611.8
Median : 732.0 Median :106 Median : 57 Median : 1518 Median :107.0 Median : 536.5 Median : 803.0
Mean : 728.2 Mean :136 Mean : 76 Mean : 1779 Mean :105.7 Mean : 553.0 Mean : 810.5
3rd Qu.: 925.0 3rd Qu.:170 3rd Qu.: 90 3rd Qu.: 1682 3rd Qu.:150.0 3rd Qu.: 611.0 3rd Qu.: 957.2
Max. :1399.0 Max. :697 Max. :201 Max. :30132 Max. :343.0 Max. :3645.0 Max. :19278.0
fielding_e fielding_dp batting_hbp_bi
Min. : 65.0 Min. : 52.0 Min. :0.00000
1st Qu.: 127.0 1st Qu.:126.0 1st Qu.:0.00000
Median : 159.0 Median :145.0 Median :0.00000
Mean : 246.5 Mean :141.9 Mean :0.08392
3rd Qu.: 249.2 3rd Qu.:162.0 3rd Qu.:0.00000
Max. :1898.0 Max. :228.0 Max. :1.00000
let’s test a model to establish a baseline
mse(stepwise_base_model)
[1] 157.695
mse(base_model)
[1] 157.4534
mse(robust_base_model)
[1] 158.2798
Let’s use caret preprocess function to help us fix the issues we found while exploring the data.
First, we will use box-cox to normalize the data.
First we start with a quick ana Let’s do a quick analysis to understand the distribution of NA values accross our dataset. Let’s sort the fields with most NAs from high to low.
#let check for NAs in the data
#Counting the number of NAs per column and check the percentage of NAs per column
NAPerc <- sapply(moneyball, function(x) (sum(is.na(x))/length(x))*100) %>%
data.frame()
NAPerc$Column <- rownames(NAPerc)
colnames(NAPerc) <- c("NA_Perc", "Col_Name")
subset(NAPerc,NA_Perc > 0) %>% arrange(desc(NA_Perc))
matrixplot(moneyball)
Click in a column to sort by the corresponding variable.
To regain use of the VIM GUI and the R console, click outside the plot region.
It’s clear that TEAM_BATTING_HBP is going to be a problematic column with 92% of the data missing. Before we start the imputation, let’s try to understand why we have missing data.
There are two types of missing data
Let’s use the mice package to help us understant how all the NAs behave in the data. mice provides a handy function called md.pattern that allows one to understand the pattern of missing data. Hopefully by looking at the pattern, we can have an idea on why the data could be missing.
md.pattern(moneyball) %>% data.frame()
Here is a great article from Rblogger that discusses the package MICE.
cor.ci(moneyball, method="spearman")
Call:corCi(x = x, keys = keys, n.iter = n.iter, p = p, overlap = overlap,
poly = poly, method = method, plot = plot, minlength = minlength)
Coefficients and bootstrapped confidence intervals
INDEX TARGE TEAM_BATTING_H TEAM_BATTING_2 TEAM_BATTING_3 TEAM_BATTING_HR TEAM_BATTING_B
INDEX 1.00
TARGET_WINS -0.01 1.00
TEAM_BATTING_H 0.00 0.37 1.00
TEAM_BATTING_2B 0.02 0.24 0.60 1.00
TEAM_BATTING_3B 0.00 0.12 0.33 -0.16 1.00
TEAM_BATTING_HR 0.05 0.16 0.07 0.44 -0.68 1.00
TEAM_BATTING_BB -0.04 0.23 0.07 0.27 -0.28 0.49 1.00
TEAM_BATTING_SO 0.09 -0.08 -0.40 0.17 -0.72 0.73 0.26
TEAM_BASERUN_SB 0.06 0.11 0.00 -0.16 0.36 -0.42 -0.15
TEAM_BASERUN_CS 0.02 -0.01 0.02 -0.02 0.20 -0.37 -0.17
TEAM_BATTING_HBP 0.06 0.04 -0.03 0.02 -0.17 0.08 0.01
TEAM_PITCHING_H -0.01 0.21 0.75 0.28 0.54 -0.30 -0.14
TEAM_PITCHING_HR 0.05 0.17 0.11 0.46 -0.65 0.98 0.46
TEAM_PITCHING_BB -0.04 0.21 0.15 0.22 -0.10 0.28 0.87
TEAM_PITCHING_SO 0.09 -0.09 -0.37 0.11 -0.60 0.57 0.12
TEAM_FIELDING_E -0.03 -0.12 0.11 -0.38 0.74 -0.81 -0.43
TEAM_FIELDING_DP 0.01 -0.05 0.18 0.25 -0.25 0.39 0.32
TEAM_BATTING_S TEAM_BASERUN_S TEAM_BASERUN_C TEAM_BATTING_HB TEAM_PITCHING_H
INDEX
TARGET_WINS
TEAM_BATTING_H
TEAM_BATTING_2B
TEAM_BATTING_3B
TEAM_BATTING_HR
TEAM_BATTING_BB
TEAM_BATTING_SO 1.00
TEAM_BASERUN_SB -0.11 1.00
TEAM_BASERUN_CS -0.17 0.67 1.00
TEAM_BATTING_HBP 0.17 -0.03 -0.06 1.00
TEAM_PITCHING_H -0.60 0.14 0.02 -0.02 1.00
TEAM_PITCHING_HR 0.69 -0.41 -0.36 0.08 -0.21
TEAM_PITCHING_BB 0.06 -0.02 -0.13 0.01 0.14
TEAM_PITCHING_SO 0.90 -0.06 -0.16 0.18 -0.37
TEAM_FIELDING_E -0.74 0.36 0.21 0.07 0.47
TEAM_FIELDING_DP 0.09 -0.42 -0.14 -0.07 0.04
TEAM_PITCHING_HR TEAM_PITCHING_B TEAM_PITCHING_S
INDEX
TARGET_WINS
TEAM_BATTING_H
TEAM_BATTING_2B
TEAM_BATTING_3B
TEAM_BATTING_HR
TEAM_BATTING_BB
TEAM_BATTING_SO
TEAM_BASERUN_SB
TEAM_BASERUN_CS
TEAM_BATTING_HBP
TEAM_PITCHING_H
TEAM_PITCHING_HR 1.00
TEAM_PITCHING_BB 0.32 1.00
TEAM_PITCHING_SO 0.59 0.04 1.00
TEAM_FIELDING_E -0.77 -0.19 -0.56
TEAM_FIELDING_DP 0.39 0.27 0.01
TEAM_FIELDING_E TEAM_FIELDING_D
TEAM_FIELDING_E 1.00
TEAM_FIELDING_DP -0.37 1.00
scale correlations and bootstrapped confidence intervals
lower.emp lower.norm estimate upper.norm upper.emp p
INDEX-TARGE -0.05 -0.05 -0.01 0.03 0.03 0.55
INDEX-TEAM_BATTING_H -0.05 -0.04 0.00 0.04 0.04 0.97
INDEX-TEAM_BATTING_2 -0.03 -0.03 0.02 0.06 0.06 0.41
INDEX-TEAM_BATTING_3 -0.05 -0.05 0.00 0.05 0.04 1.00
INDEX-TEAM_BATTING_HR 0.01 0.01 0.05 0.09 0.10 0.01
INDEX-TEAM_BATTING_B -0.08 -0.08 -0.04 0.00 0.00 0.07
INDEX-TEAM_BATTING_S 0.05 0.05 0.09 0.13 0.13 0.00
INDEX-TEAM_BASERUN_S 0.03 0.02 0.06 0.11 0.11 0.00
INDEX-TEAM_BASERUN_C -0.03 -0.03 0.02 0.08 0.08 0.39
INDEX-TEAM_BATTING_HB -0.09 -0.09 0.06 0.19 0.20 0.50
INDEX-TEAM_PITCHING_H -0.06 -0.06 -0.01 0.03 0.03 0.60
INDEX-TEAM_PITCHING_HR 0.02 0.01 0.05 0.09 0.09 0.01
INDEX-TEAM_PITCHING_B -0.08 -0.08 -0.04 0.00 0.00 0.06
INDEX-TEAM_PITCHING_S 0.05 0.05 0.09 0.13 0.12 0.00
INDEX-TEAM_FIELDING_E -0.07 -0.07 -0.03 0.01 0.00 0.10
INDEX-TEAM_FIELDING_D -0.03 -0.03 0.01 0.05 0.05 0.48
TARGE-TEAM_BATTING_H 0.33 0.33 0.37 0.40 0.41 0.00
TARGE-TEAM_BATTING_2 0.20 0.20 0.24 0.27 0.27 0.00
TARGE-TEAM_BATTING_3 0.09 0.09 0.12 0.17 0.16 0.00
TARGE-TEAM_BATTING_HR 0.11 0.11 0.16 0.20 0.19 0.00
TARGE-TEAM_BATTING_B 0.18 0.18 0.23 0.27 0.27 0.00
TARGE-TEAM_BATTING_S -0.13 -0.12 -0.08 -0.03 -0.03 0.00
TARGE-TEAM_BASERUN_S 0.07 0.07 0.11 0.16 0.15 0.00
TARGE-TEAM_BASERUN_C -0.05 -0.05 -0.01 0.04 0.03 0.70
TARGE-TEAM_BATTING_HB -0.12 -0.13 0.04 0.19 0.21 0.70
TARGE-TEAM_PITCHING_H 0.17 0.17 0.21 0.26 0.26 0.00
TARGE-TEAM_PITCHING_HR 0.12 0.12 0.17 0.20 0.20 0.00
TARGE-TEAM_PITCHING_B 0.16 0.17 0.21 0.25 0.25 0.00
TARGE-TEAM_PITCHING_S -0.13 -0.14 -0.09 -0.05 -0.05 0.00
TARGE-TEAM_FIELDING_E -0.16 -0.17 -0.12 -0.07 -0.07 0.00
TARGE-TEAM_FIELDING_D -0.09 -0.10 -0.05 -0.01 -0.01 0.02
TEAM_BATTING_H-TEAM_BATTING_2 0.57 0.57 0.60 0.62 0.62 0.00
TEAM_BATTING_H-TEAM_BATTING_3 0.30 0.29 0.33 0.36 0.36 0.00
TEAM_BATTING_H-TEAM_BATTING_HR 0.02 0.02 0.07 0.11 0.11 0.00
TEAM_BATTING_H-TEAM_BATTING_B 0.02 0.03 0.07 0.12 0.12 0.00
TEAM_BATTING_H-TEAM_BATTING_S -0.44 -0.44 -0.40 -0.35 -0.35 0.00
TEAM_BATTING_H-TEAM_BASERUN_S -0.04 -0.04 0.00 0.05 0.05 0.77
TEAM_BATTING_H-TEAM_BASERUN_C -0.03 -0.04 0.02 0.08 0.07 0.53
TEAM_BATTING_H-TEAM_BATTING_HB -0.16 -0.18 -0.03 0.11 0.11 0.65
TEAM_BATTING_H-TEAM_PITCHING_H 0.72 0.72 0.75 0.78 0.78 0.00
TEAM_BATTING_H-TEAM_PITCHING_HR 0.06 0.06 0.11 0.15 0.15 0.00
TEAM_BATTING_H-TEAM_PITCHING_B 0.11 0.11 0.15 0.20 0.19 0.00
TEAM_BATTING_H-TEAM_PITCHING_S -0.41 -0.41 -0.37 -0.33 -0.32 0.00
TEAM_BATTING_H-TEAM_FIELDING_E 0.07 0.07 0.11 0.15 0.15 0.00
TEAM_BATTING_H-TEAM_FIELDING_D 0.13 0.13 0.18 0.22 0.22 0.00
TEAM_BATTING_2-TEAM_BATTING_3 -0.19 -0.19 -0.16 -0.12 -0.12 0.00
TEAM_BATTING_2-TEAM_BATTING_HR 0.40 0.41 0.44 0.48 0.48 0.00
TEAM_BATTING_2-TEAM_BATTING_B 0.23 0.23 0.27 0.31 0.31 0.00
TEAM_BATTING_2-TEAM_BATTING_S 0.12 0.12 0.17 0.21 0.22 0.00
TEAM_BATTING_2-TEAM_BASERUN_S -0.21 -0.20 -0.16 -0.12 -0.12 0.00
TEAM_BATTING_2-TEAM_BASERUN_C -0.09 -0.08 -0.02 0.03 0.03 0.41
TEAM_BATTING_2-TEAM_BATTING_HB -0.10 -0.12 0.02 0.15 0.16 0.80
TEAM_BATTING_2-TEAM_PITCHING_H 0.24 0.24 0.28 0.32 0.32 0.00
TEAM_BATTING_2-TEAM_PITCHING_HR 0.42 0.42 0.46 0.49 0.49 0.00
TEAM_BATTING_2-TEAM_PITCHING_B 0.18 0.18 0.22 0.26 0.26 0.00
TEAM_BATTING_2-TEAM_PITCHING_S 0.07 0.07 0.11 0.15 0.15 0.00
TEAM_BATTING_2-TEAM_FIELDING_E -0.41 -0.41 -0.38 -0.34 -0.34 0.00
TEAM_BATTING_2-TEAM_FIELDING_D 0.21 0.21 0.25 0.29 0.29 0.00
TEAM_BATTING_3-TEAM_BATTING_HR -0.70 -0.70 -0.68 -0.65 -0.65 0.00
TEAM_BATTING_3-TEAM_BATTING_B -0.33 -0.33 -0.28 -0.25 -0.25 0.00
TEAM_BATTING_3-TEAM_BATTING_S -0.74 -0.74 -0.72 -0.70 -0.71 0.00
TEAM_BATTING_3-TEAM_BASERUN_S 0.31 0.31 0.36 0.40 0.39 0.00
TEAM_BATTING_3-TEAM_BASERUN_C 0.14 0.14 0.20 0.25 0.24 0.00
TEAM_BATTING_3-TEAM_BATTING_HB -0.33 -0.32 -0.17 -0.05 -0.05 0.01
TEAM_BATTING_3-TEAM_PITCHING_H 0.50 0.50 0.54 0.57 0.57 0.00
TEAM_BATTING_3-TEAM_PITCHING_HR -0.67 -0.67 -0.65 -0.62 -0.62 0.00
TEAM_BATTING_3-TEAM_PITCHING_B -0.14 -0.14 -0.10 -0.06 -0.06 0.00
TEAM_BATTING_3-TEAM_PITCHING_S -0.63 -0.63 -0.60 -0.57 -0.57 0.00
TEAM_BATTING_3-TEAM_FIELDING_E 0.72 0.72 0.74 0.76 0.76 0.00
TEAM_BATTING_3-TEAM_FIELDING_D -0.30 -0.30 -0.25 -0.21 -0.21 0.00
TEAM_BATTING_HR-TEAM_BATTING_B 0.45 0.46 0.49 0.53 0.53 0.00
TEAM_BATTING_HR-TEAM_BATTING_S 0.71 0.71 0.73 0.75 0.75 0.00
TEAM_BATTING_HR-TEAM_BASERUN_S -0.46 -0.46 -0.42 -0.38 -0.39 0.00
TEAM_BATTING_HR-TEAM_BASERUN_C -0.42 -0.42 -0.37 -0.33 -0.33 0.00
TEAM_BATTING_HR-TEAM_BATTING_HB -0.07 -0.07 0.08 0.23 0.23 0.30
TEAM_BATTING_HR-TEAM_PITCHING_H -0.34 -0.35 -0.30 -0.26 -0.26 0.00
TEAM_BATTING_HR-TEAM_PITCHING_HR 0.97 0.97 0.98 0.98 0.98 0.00
TEAM_BATTING_HR-TEAM_PITCHING_B 0.25 0.25 0.28 0.32 0.32 0.00
TEAM_BATTING_HR-TEAM_PITCHING_S 0.54 0.53 0.57 0.61 0.61 0.00
TEAM_BATTING_HR-TEAM_FIELDING_E -0.82 -0.82 -0.81 -0.80 -0.80 0.00
TEAM_BATTING_HR-TEAM_FIELDING_D 0.34 0.34 0.39 0.43 0.43 0.00
TEAM_BATTING_B-TEAM_BATTING_S 0.22 0.22 0.26 0.30 0.30 0.00
TEAM_BATTING_B-TEAM_BASERUN_S -0.19 -0.20 -0.15 -0.11 -0.11 0.00
TEAM_BATTING_B-TEAM_BASERUN_C -0.22 -0.22 -0.17 -0.12 -0.12 0.00
TEAM_BATTING_B-TEAM_BATTING_HB -0.15 -0.15 0.01 0.16 0.16 0.96
TEAM_BATTING_B-TEAM_PITCHING_H -0.19 -0.20 -0.14 -0.10 -0.10 0.00
TEAM_BATTING_B-TEAM_PITCHING_HR 0.43 0.43 0.46 0.50 0.50 0.00
TEAM_BATTING_B-TEAM_PITCHING_B 0.85 0.85 0.87 0.89 0.89 0.00
TEAM_BATTING_B-TEAM_PITCHING_S 0.08 0.07 0.12 0.16 0.15 0.00
TEAM_BATTING_B-TEAM_FIELDING_E -0.46 -0.46 -0.43 -0.39 -0.39 0.00
TEAM_BATTING_B-TEAM_FIELDING_D 0.28 0.27 0.32 0.36 0.36 0.00
TEAM_BATTING_S-TEAM_BASERUN_S -0.16 -0.16 -0.11 -0.07 -0.08 0.00
TEAM_BATTING_S-TEAM_BASERUN_C -0.21 -0.22 -0.17 -0.13 -0.13 0.00
TEAM_BATTING_S-TEAM_BATTING_HB 0.01 0.03 0.17 0.31 0.32 0.02
TEAM_BATTING_S-TEAM_PITCHING_H -0.63 -0.63 -0.60 -0.57 -0.57 0.00
TEAM_BATTING_S-TEAM_PITCHING_HR 0.67 0.67 0.69 0.71 0.71 0.00
TEAM_BATTING_S-TEAM_PITCHING_B 0.02 0.02 0.06 0.10 0.10 0.00
TEAM_BATTING_S-TEAM_PITCHING_S 0.88 0.88 0.90 0.92 0.92 0.00
TEAM_BATTING_S-TEAM_FIELDING_E -0.76 -0.76 -0.74 -0.72 -0.72 0.00
TEAM_BATTING_S-TEAM_FIELDING_D 0.05 0.05 0.09 0.14 0.15 0.00
TEAM_BASERUN_S-TEAM_BASERUN_C 0.65 0.64 0.67 0.70 0.70 0.00
TEAM_BASERUN_S-TEAM_BATTING_HB -0.18 -0.18 -0.03 0.12 0.10 0.67
TEAM_BASERUN_S-TEAM_PITCHING_H 0.10 0.10 0.14 0.19 0.18 0.00
TEAM_BASERUN_S-TEAM_PITCHING_HR -0.44 -0.44 -0.41 -0.37 -0.37 0.00
TEAM_BASERUN_S-TEAM_PITCHING_B -0.07 -0.06 -0.02 0.03 0.03 0.50
TEAM_BASERUN_S-TEAM_PITCHING_S -0.10 -0.10 -0.06 -0.01 -0.01 0.01
TEAM_BASERUN_S-TEAM_FIELDING_E 0.31 0.31 0.36 0.40 0.40 0.00
TEAM_BASERUN_S-TEAM_FIELDING_D -0.46 -0.46 -0.42 -0.37 -0.38 0.00
TEAM_BASERUN_C-TEAM_BATTING_HB -0.20 -0.19 -0.06 0.08 0.05 0.39
TEAM_BASERUN_C-TEAM_PITCHING_H -0.03 -0.03 0.02 0.07 0.07 0.52
TEAM_BASERUN_C-TEAM_PITCHING_HR -0.41 -0.42 -0.36 -0.32 -0.32 0.00
TEAM_BASERUN_C-TEAM_PITCHING_B -0.17 -0.18 -0.13 -0.08 -0.09 0.00
TEAM_BASERUN_C-TEAM_PITCHING_S -0.21 -0.21 -0.16 -0.11 -0.11 0.00
TEAM_BASERUN_C-TEAM_FIELDING_E 0.16 0.16 0.21 0.26 0.26 0.00
TEAM_BASERUN_C-TEAM_FIELDING_D -0.19 -0.19 -0.14 -0.08 -0.07 0.00
TEAM_BATTING_HB-TEAM_PITCHING_H -0.16 -0.17 -0.02 0.12 0.11 0.69
TEAM_BATTING_HB-TEAM_PITCHING_HR -0.07 -0.07 0.08 0.23 0.23 0.29
TEAM_BATTING_HB-TEAM_PITCHING_B -0.15 -0.15 0.01 0.16 0.16 0.95
TEAM_BATTING_HB-TEAM_PITCHING_S 0.01 0.03 0.18 0.31 0.32 0.02
TEAM_BATTING_HB-TEAM_FIELDING_E -0.04 -0.05 0.07 0.21 0.21 0.21
TEAM_BATTING_HB-TEAM_FIELDING_D -0.18 -0.20 -0.07 0.08 0.08 0.38
TEAM_PITCHING_H-TEAM_PITCHING_HR -0.25 -0.25 -0.21 -0.17 -0.17 0.00
TEAM_PITCHING_H-TEAM_PITCHING_B 0.09 0.09 0.14 0.19 0.18 0.00
TEAM_PITCHING_H-TEAM_PITCHING_S -0.41 -0.41 -0.37 -0.33 -0.33 0.00
TEAM_PITCHING_H-TEAM_FIELDING_E 0.42 0.43 0.47 0.51 0.50 0.00
TEAM_PITCHING_H-TEAM_FIELDING_D -0.02 -0.02 0.04 0.08 0.08 0.22
TEAM_PITCHING_HR-TEAM_PITCHING_B 0.28 0.28 0.32 0.36 0.36 0.00
TEAM_PITCHING_HR-TEAM_PITCHING_S 0.55 0.55 0.59 0.62 0.62 0.00
TEAM_PITCHING_HR-TEAM_FIELDING_E -0.78 -0.78 -0.77 -0.75 -0.75 0.00
TEAM_PITCHING_HR-TEAM_FIELDING_D 0.34 0.34 0.39 0.44 0.43 0.00
TEAM_PITCHING_B-TEAM_PITCHING_S 0.00 0.00 0.04 0.09 0.08 0.03
TEAM_PITCHING_B-TEAM_FIELDING_E -0.23 -0.23 -0.19 -0.15 -0.16 0.00
TEAM_PITCHING_B-TEAM_FIELDING_D 0.22 0.22 0.27 0.31 0.30 0.00
TEAM_PITCHING_S-TEAM_FIELDING_E -0.60 -0.60 -0.56 -0.52 -0.52 0.00
TEAM_PITCHING_S-TEAM_FIELDING_D -0.03 -0.03 0.01 0.06 0.06 0.49
TEAM_FIELDING_E-TEAM_FIELDING_D -0.41 -0.41 -0.37 -0.33 -0.33 0.00
The Caret package offers the findcorrelation(), which takes the correlation matrix as an input and finds the fields causing multicollinearity based on a threshold, the cutoff parameter. It in turns returns a vector with values that would need to be removed from our dataset due to correlation. ## Reference